Breast cancer causes the greatest number of cancer-related deaths among women.This year, an estimated 42,170 women will die from breast cancer in the U.S., (according to www.nationalbreastcancer.org). Using prediction techniques on genetic data has the potentials of giving the correct estimation of survival time and can prevent unnecessary surgical and treatment procedures.
“The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database is a Canada-UK Project which contains targeted sequencing data of 1,980 primary breast cancer samples. Clinical and genomic data was downloaded from cBioPortal.”
The dataset was collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada. Therefore, our population is woman who have visited this Cancer Centre. A description of your data, e.g. what is the unit of observation, what is the response variable, what are the predictors, how was the data collected, reference etc. Comment whether this sample of data is suitable to assess your population.
The dataset was obtained from ‘Kaggle’ and contains 693 variables and 1,904 observations. This file includes 31 clinical attributes, m-RNA levels z-score for 331 genes, and mutation in 175 genes for 1904 breast cancer patients. The data was originally collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada. This data is representative of my population because this information includes patients who have been examined for breast cancer at varying levels rencently in the year 2020. For the purpose of this analysis I will be reducing the number of variables to 31 of the most relevant to this modeling. There was then the removal of ‘NA’ items in order for missing observations to be retracted from the dataset.